library(phyloseq)
library(ggplot2)
library(gridExtra)
library(Biostrings)
library(DECIPHER)
library(phangorn)
library(plyr)
load(params$DATA)
ls()
## [1] "params"       "SeqTabNoChim" "taxa"

Naming sequencies

Above I loaded the taxonomy data, including the table of abundances (SeqTabNoChim) and the assigned taxonomy (taxa). I name all sequences with a taxonomy attribution.

dna <- DNAStringSet(row.names(taxa))
names(dna) <- sprintf("A%04i", seq(dim(taxa)[1]))
row.names(taxa)        <- names(dna)[match(row.names(taxa), dna)] 
colnames(SeqTabNoChim) <- names(dna)[match(colnames(SeqTabNoChim), dna)]

Sample data

SampleData <- data.frame(
  age = factor(substr(rownames(SeqTabNoChim), 1, 1), levels=c('E','L'), labels=c('Early','Late')),
  isoline = factor(stringr::str_extract(rownames(SeqTabNoChim), "[[:digit:]]+"),
                   levels=as.character(c(6,10,11,12,14,15,17,19,20,22,23,24,25,26,27,28,29,30,31,33,35,36,38,39)),
                   ordered=TRUE),
  replicate = gsub("([EL][[:digit:]]+|.fastq.gz)", "", rownames(SeqTabNoChim)),
  seqrun = factor(1, levels=c(1,2))
  #simpson = vegan::diversity(SeqTabClean, index='simpson', MARGIN=1),
  #shannon = vegan::diversity(SeqTabClean, index='shannon', MARGIN=1)
)
row.names(SampleData) <- rownames(SeqTabNoChim)
run2 <- paste(substring(SampleData$age, 1, 1),
              SampleData$isoline,
              SampleData$replicate,
              '_R1.fastq.gz', sep='') %in% dir(params$RUN2)
SampleData[run2, 'seqrun'] <- 2
rm(run2)

Additional filtering

Recall the SeqTabNoChim matrix of abundances has been filtered before: only sequence lengths in the range 5, 5 are present, and sequences of unknown or unreliable Phylum have been removed (though they are still present in taxa). Following (Callahan et al. 2016), I take a look at prevalence (number of samples a taxon is observed at least once), and its relationship with total abundance.

Prevalence <- data.frame(
  Prev.Early = colSums(SeqTabNoChim[SampleData$age %in% 'Early',] > 0),
  Prev.Late  = colSums(SeqTabNoChim[SampleData$age %in% 'Late',] > 0),
  Abun.Early = colSums(SeqTabNoChim[SampleData$age %in% 'Early',]),
  Abun.Late  = colSums(SeqTabNoChim[SampleData$age %in% 'Late',]),
  prevalence = colSums(SeqTabNoChim > 0),   # taxa are columns
  abundance  = colSums(SeqTabNoChim),
  Phylum     = factor(taxa[colnames(SeqTabNoChim), 'Phylum']),
  Order      = factor(taxa[colnames(SeqTabNoChim), 'Order'])
)
Proportions <- apply(SeqTabNoChim, 1, function(x) x/sum(x))
Prevalence$meanProp <- apply(Proportions, 1, mean)
Prevalence$meanPropEarly <- apply(Proportions[, SampleData$age %in% 'Early'], 1, mean)
Prevalence$meanPropLate  <- apply(Proportions[, SampleData$age %in% 'Late'],  1, mean)
ggplot(Prevalence, aes(x=meanProp, y=prevalence)) +
  geom_point(size=0.5) + scale_x_log10() + facet_wrap(~Phylum)

#ggplot(Prevalence[Prevalence$Phylum=='Proteobacteria',], aes(x=meanProp, y=prevalence, color=Order)) +
#  geom_point() + scale_x_log10()
ggplot(Prevalence[Prevalence$Phylum=='Proteobacteria',], aes(x=meanPropEarly, y=Prev.Early, color=Order)) +
  geom_point() + scale_x_log10() + ggtitle('Proteobacteria in early samples')

ggplot(Prevalence[Prevalence$Phylum=='Proteobacteria',], aes(x=meanPropLate, y=Prev.Late, color=Order)) +
  geom_point() + scale_x_log10() + ggtitle('Proteobacteria in late samples')

ggplot(Prevalence, aes(x=meanPropEarly, y=meanPropLate)) + geom_point() +
  scale_x_log10() + scale_y_log10() + geom_abline(slope=1, intercept=0, color='red')

In the last plot, I see among the most abundant amplicons, many that keep the same relative abundance in early and late times, some that are more abundant late than early, and some that are more abundant early.

In the preprocessing tutorial some filters are suggested that can be applied to a phyloseq object. Here, I filter sequences before creating the phyloseq object, because that way the alignment and phylogenetic tree reconstruction steps will be faster.

(Callahan et al. 2016) suggests to filter by prevalence. They remove taxa that are not present in at least 5% of samples. However, the preprocessing tutorial suggests to filter by mean relative abundance, using \(10^{-5}\) as the threshold.

On a first run, I imposed the thresholds on global measures of mean proportion and prevalence. However, I realize that early and late samples should be allowed to have different compositions. I use now a filter that respects the independence of early and late samples.

filter <- (Prevalence$meanPropEarly >= 1.0e-05 | Prevalence$meanPropLate >= 1.0e-05) & 
  (Prevalence$Prev.Early > 10 | Prevalence$Prev.Late > 10)
AmpFilt <- sort(colnames(SeqTabNoChim)[filter])
SeqTabClean <- SeqTabNoChim[,filter]
dim(SeqTabClean)
## [1]  177 1236

Alignment and tree

I had aligned them before, on 2019-12-12. But for the sake of clarity, I align them here as well, in order to create a phylogenetic tree and include that information in the phyloseq object. I align only the filtered sequences.

Below, I use conditional execution as a home made substitute for the cache=TRUE option, which does not seem to work well sometimes.

if (file.exists('alignment.RData')) {
   load('alignment.RData')
} else {
   alignment <- AlignSeqs(RNAStringSet(dna[AmpFilt]))
   BrowseSeqs(alignment, htmlFile='alignment.html', openURL=FALSE)
   save(alignment, file='alignment.RData')
   # Forces removal of tree upon update of alignment:
   if (file.exists('fitGTR.RData')) file.remove('fitGTR.RData')
}

Next few lines mostly copied from (Callahan et al. 2016). I suppose fitting the tree by maximum likelihood may take a while. This is one of the chunks where I would use the cache=TRUE option to prevent it from running if the results are saved (and neither the input nor the code changed since the time the results were saved). However, the chunk’s cache does not seem to work properly. It may be an RStudio issue. I resort to save the results explicitly, and make the chunk’s execution conditional on their absence. I will have to remove the results manually if I want them updated.

if (file.exists('fitGTR.RData')) {
   load('fitGTR.RData')
} else {
   phang.align <- phyDat(as(DNAStringSet(alignment), "matrix"), type="DNA")
   dm <- dist.ml(phang.align)
   treeNJ <- NJ(dm) # Note, tip order != sequence order
   fit = pml(treeNJ, data=phang.align)

   fitGTR <- update(fit, k=4, inv=0.2)
   fitGTR <- optim.pml(fitGTR, model="GTR", optInv=TRUE, optGamma=TRUE,
                      rearrangement = "stochastic", control = pml.control(trace = 0))
   detach("package:phangorn", unload=TRUE)   # avoids name clashes.
   save(fitGTR, file='fitGTR.RData')
}

Phyloseq object

Now, I start using the phyloseq package (McMurdie and Holmes 2013). Row names in the matrix of abundances SeqTabClean contains the fastq file names corresponding to each library, where the three variables identifying the samples are encoded, namely the sampling time (or fly age), the isoline, and the replicate.

SampleData$simpson <- vegan::diversity(SeqTabClean, index='simpson', MARGIN=1)
SampleData$shannon <- vegan::diversity(SeqTabClean, index='shannon', MARGIN=1)

ps <- phyloseq(otu_table(SeqTabClean, taxa_are_rows=FALSE), 
               sample_data(SampleData), 
               tax_table(taxa[AmpFilt,]),
               phy_tree(fitGTR$tree))
ps <- merge_phyloseq(ps, dna[AmpFilt])
ps
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 1236 taxa and 177 samples ]
## sample_data() Sample Data:       [ 177 samples by 6 sample variables ]
## tax_table()   Taxonomy Table:    [ 1236 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 1236 tips and 1234 internal nodes ]
## refseq()      DNAStringSet:      [ 1236 reference sequences ]
plot_richness(ps, x="isoline", measures=c("Shannon", "Simpson"), color="age", shape="seqrun")

To generate a phyloseq object with relative abundance (proportions), I prefer to create it de novo from the PropClean matrix created before, instead of using the transform_sample_counts function. The reason is that I want to use the original, true values of relative abundance, and not the proportions calculated on a subset of the original taxa.

ps.prop <- phyloseq(otu_table(Proportions[AmpFilt,], taxa_are_rows=TRUE),
                    sample_data(SampleData),
                    tax_table(taxa[AmpFilt,]),
                    phy_tree(fitGTR$tree))

save(ps, ps.prop, file='ps.RData')

ggplot(psmelt(ps.prop), aes(x=age, y=Abundance)) +
  geom_violin() + facet_wrap(~Phylum) + scale_y_log10()

ps.prot <- subset_taxa(ps.prop, Phylum %in% c("Proteobacteria"))
ggplot(psmelt(ps.prot), aes(x=age, y=Abundance)) +
  geom_violin() + facet_wrap(~Order) + scale_y_log10()

ord.nmds.bray <- ordinate(ps.prop, method="NMDS", distance="bray", trymax=1000, trace=0)
# The "trace=0" above supresses the long output.
plot_ordination(ps.prop, ord.nmds.bray, color="age", title="Bray NMDS")

Reduced dataset

As mentioned before, a more aggressive filtering can be applied to remove taxa that have not been observed a minimum number of times (4) in a minimum number of samples (20%). I show the code below, but don’t run it now. Previous runs suggest that reducing the dataset so much removes the main differences between early and late communities.

ps.clean2 <- filter_taxa(ps, function(x) sum(x > 3) > (0.2*length(x)), TRUE)
total <- median(sample_sums(ps.clean2))
standf <- function(x, t=total) round(t * (x / sum(x)))
ps.clean2 <- transform_sample_counts(ps.clean2, standf)
# Now abundances are standardized to median sequencing depth.
ps.clean2
# In phyloseq tutorial they reverse the "<" operator, effectively keeping only
# taxa with high CV, for plotting purposes. In any case, the filter below does
# not remove much, after the previous steps. 
ps.clean2 <- filter_taxa(ps.clean2, function(x) sd(x)/mean(x) <= 3.5, TRUE)
ps.clean2
table(ps.clean2@tax_table@.Data[,'Phylum'])
table(ps.clean2@tax_table@.Data[,'Genus'])

Ordination

Following the tutorial on distances. Note that here I use MDS, while the plot produced above was done using NMDS.

dist_methods <- unlist(distanceMethodList)
# in addition to designdist, I remove those that cause errors:
dist_methods <- dist_methods[! names(dist_methods) %in% c('vegdist1', 'designdist', 'vegdist2', 'dist3')]
plist <- vector("list", length(dist_methods))
names(plist) <- dist_methods
# Abundance should be integer, but standardized.
total <- median(sample_sums(ps))
standf <- function(x, t=total) round(t * (x / sum(x)))
ps.used <- transform_sample_counts(ps, standf, t=median(sample_sums(ps)))
#ps.used <- ps.clean2
for( i in dist_methods ){
    # Calculate distance matrix
    iDist <- phyloseq::distance(ps.used, method=i)
    # Calculate ordination
    iMDS  <- ordinate(ps.used, "MDS", distance=iDist)
    ## Make plot
    # Don't carry over previous plot (if error, p will be blank)
    p <- NULL
    # Create plot, store as temp variable, p
    p <- plot_ordination(ps.used, iMDS, color="age")
    # Add title to each plot
    p <- p + ggtitle(paste("MDS using distance method ", i, sep=""))
    # Save the graphic to file.
    plist[[i]] = p
}
## Warning in UniFrac(physeq, ...): Randomly assigning root as -- A2312 -- in the
## phylogenetic tree in the data you provided.
## Warning in UniFrac(physeq, weighted = TRUE, ...): Randomly assigning root as --
## A1979 -- in the phylogenetic tree in the data you provided.
## Warning in is.euclid(patristicDist): Zero distance(s)
## Warning in is.euclid(dis): Zero distance(s)
df = ldply(plist, function(x) x$data)
names(df)[1] <- "Distance"
p <- ggplot(df, aes(Axis.1, Axis.2, color=age))
p <- p + geom_point(size=3, alpha=0.5)
p <- p + facet_wrap(~Distance, scales="free")
p <- p + ggtitle("MDS on various distance metrics")
p

In the plot above, the ordinations that separate better between old and young microbiomes seem to be those that use one of the distance methods implemented in the betadiver() function of the vegan package, such as -1, -2, -3, 19, cc, e, g, hkbetadiver() calculates either \(\Beta\) diversity indices or distances among local paterns of presence-absence of species. The binary distance, from the dist() function also works well, and it also treats abundances as presence-absence, actually.

When using ps.prop, with 1236 taxa, samples are much better separate in early and late ages than if using the reduced ps.clean2 dataset.

Taken together, these results suggest that differences between young and old microbiomes are subtle and can only be appreciated when looking at the presence or absence of both low and high frequency taxa. To put it another way, they do not differ much in the most abundant taxa.

The same is true for the batch effect. The same distance methods that separate young and old microbiomes also distinguish quite well between the two sequencing runs. Curiously, the discrimination between sequencing runs is almos orthogonal to that of young and old samples. That is, the two first axes of ordinations based on presence-absence distances correspond to the two main drivers of diversity among our samples: age and batch effect. Axis 1 mostly corresponds with the age of the samples, and Axis 2, with the sequencing batch. Although a small counter-clock wise rotation would help the interpretation of the axes.

Just to illustrate the point:

p <- ggplot(df[df$Distance == 'binary',], aes(Axis.1, Axis.2, color=age))
p <- p + geom_point(size=3, alpha=0.5)
p <- p + ggtitle("MDS on a binary distance. Coloring by age")
p

p <- ggplot(df[df$Distance == 'binary',], aes(Axis.1, Axis.2, color=seqrun))
p <- p + geom_point(size=3, alpha=0.5)
p <- p + ggtitle("MDS on a binary distance. Coloring by sequencing run")
p

Rotation

I found that the recluster package can rotate the coordinates of an MDS to make a line be horizontal. Ideally, the line would be a the boundary decision of a linear discriminant analysis. But I have not found a way to extract the linear expression of the boundary from the LDA result. Below, I use a manual approximation.

library(recluster)
## Loading required package: vegan
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.5-6
## 
## Attaching package: 'vegan'
## The following objects are masked from 'package:phangorn':
## 
##     diversity, treedist
BinaryMDS <- df[df$Distance == 'binary',]
tilted <- recluster.rotate(BinaryMDS[,c('Axis.1','Axis.2')], m=-0.47, q=-0.027)
BinaryMDS$Rotated1 <- tilted$Axis.1
BinaryMDS$Rotated2 <- tilted$Axis.2
rm(tilted)
ggplot(BinaryMDS, aes(x=Rotated1, y=Rotated2, color=seqrun)) +
  geom_point(size=3, alpha=0.5) +
  ggtitle("Slightly rotated MDS on binary distance")

ggplot(BinaryMDS, aes(x=Rotated1, y=Rotated2, color=age)) +
  geom_point(size=3, alpha=0.5)

Acetobacter/Lactobacillus ratio

Lactobacillus <- AmpFilt[AmpFilt %in% row.names(taxa[taxa[,'Genus'] %in% 'Lactobacillus',])]
Acetobacter   <- AmpFilt[AmpFilt %in% row.names(taxa[taxa[,'Genus'] %in% 'Acetobacter',])]
LactoProp <- colSums(Proportions[Lactobacillus,])
AcetoProp <- colSums(Proportions[Acetobacter,])
summary(LactoProp)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 4.226e-05 8.079e-04 4.540e-03 4.081e-03 8.693e-02
summary(AcetoProp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04318 0.92323 0.97939 0.88894 0.99129 0.99765
SampleData$LactoAcetoRatio <- LactoProp / AcetoProp
# It includes 0, but not Inf. 
ggplot(SampleData, aes(x=age, y=LactoAcetoRatio)) + 
  geom_boxplot(notch=TRUE) + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 33 rows containing non-finite values (stat_boxplot).

ggplot(SampleData, aes(x=seqrun, y=LactoAcetoRatio)) +
  geom_boxplot(notch=TRUE) + scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis

## Warning: Removed 33 rows containing non-finite values (stat_boxplot).

Export data

Here I produce a csv file with the SampleData dataframe, which includes Simpson and Shannon’s diversity indices and the rotated coordinates of the MDS produced with the binary distance. The rotation is meant to make the MDS axis more readily interpretable as age (axis 1) and batch (axis 2) effects. I also include the original axis, just in case.

# Actually, samples are in the same order in SampleData and BinaryMDS.
SampleData$RotAxis1 <- BinaryMDS$Rotated1
SampleData$RotAxis2 <- BinaryMDS$Rotated2
SampleData$OriAxis1 <- BinaryMDS$Axis.1
SampleData$OriAxis2 <- BinaryMDS$Axis.2
write.csv(SampleData, file='DiversityAbundance.csv', quote=FALSE)
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] recluster_2.8       vegan_2.5-6         lattice_0.20-41    
##  [4] permute_0.9-5       plyr_1.8.6          phangorn_2.5.5     
##  [7] ape_5.3             DECIPHER_2.14.0     RSQLite_2.2.0      
## [10] Biostrings_2.54.0   XVector_0.26.0      IRanges_2.20.2     
## [13] S4Vectors_0.24.4    BiocGenerics_0.32.0 gridExtra_2.3      
## [16] ggplot2_3.3.0       phyloseq_1.30.0    
## 
## loaded via a namespace (and not attached):
##  [1] Biobase_2.46.0          maps_3.3.0              bit64_0.9-7            
##  [4] jsonlite_1.6.1          splines_3.6.3           foreach_1.5.0          
##  [7] gtools_3.8.2            assertthat_0.2.1        expm_0.999-4           
## [10] animation_2.6           blob_1.2.1              yaml_2.2.1             
## [13] numDeriv_2016.8-1.1     pillar_1.4.3            glue_1.4.0             
## [16] quadprog_1.5-8          digest_0.6.25           picante_1.8.1          
## [19] colorspace_1.4-1        htmltools_0.4.0         Matrix_1.2-18          
## [22] pkgconfig_2.0.3         zlibbioc_1.32.0         purrr_0.3.4            
## [25] scales_1.1.0            combinat_0.0-8          tibble_3.0.1           
## [28] mgcv_1.8-31             farver_2.0.3            ellipsis_0.3.0         
## [31] withr_2.2.0             mnormt_1.5-6            survival_3.1-12        
## [34] magrittr_1.5            crayon_1.3.4            memoise_1.1.0          
## [37] evaluate_0.14           nlme_3.1-147            MASS_7.3-51.5          
## [40] tools_3.6.3             data.table_1.12.8       lifecycle_0.2.0        
## [43] stringr_1.4.0           phytools_0.7-20         Rhdf5lib_1.8.0         
## [46] munsell_0.5.0           plotrix_3.7-8           cluster_2.1.0          
## [49] ade4_1.7-15             compiler_3.6.3          clusterGeneration_1.3.4
## [52] rlang_0.4.5             rhdf5_2.30.1            grid_3.6.3             
## [55] iterators_1.0.12        biomformat_1.14.0       igraph_1.2.5           
## [58] labeling_0.3            rmarkdown_2.1           gtable_0.3.0           
## [61] codetools_0.2-16        multtest_2.42.0         DBI_1.1.0              
## [64] reshape2_1.4.4          R6_2.4.1                knitr_1.28             
## [67] dplyr_0.8.5             bit_1.1-15.2            fastmatch_1.1-0        
## [70] stringi_1.4.6           Rcpp_1.0.4.6            vctrs_0.2.4            
## [73] coda_0.19-3             scatterplot3d_0.3-41    tidyselect_1.0.0       
## [76] xfun_0.13

References

Callahan, B.J., K. Sankaran, J.A. Fukuyama, P.J. McMurdie, and S.P. Holmes. 2016. “Bioconductor Workflow for Microbiome Data Analysis: From Raw Reads to Community Analyses.” F1000Research 5: 1492. https://doi.org/https://doi.org/10.12688/f1000research.8986.2.

McMurdie, P.J., and Susan Holmes. 2013. “Phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PLoS ONE 8 (4): e61217. https://doi.org/https://doi.org/10.1371/journal.pone.0061217.